A Technical Word and Term Translation Aidusing Noisy Parallel Corpora across

نویسنده

  • Pascale Fung
چکیده

Technical term translation represents one of the most diicult tasks for human translators since (1) most translators are not familiar with term and domain speciic terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators for technical term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language and character-set independent, and is robust to noise in the corpus. We show how our algorithm requires minimum preprocessing and is able to obtain technical word translations without sentence boundary identiication or sentence alignment, from the English/Japanese AWK manual corpus with noise arising from text insertions or deletions and on the English/Chinese HKUST bilingual corpus. We obtain a precision of 55.35% from the AWK corpus for word translation including rare words, counting only the best candidate and direct translations. Translation precision of the the best candidate translation is 89.93% from the HKUST corpus. Potential term translations produced by the program help bilingual speakers to get a 47% improvement in translating technical terms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Noisy Bilingual Data for Statistical Machine Translation

SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given.

متن کامل

A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words ...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Automatic Parallel Fragment Extraction from Noisy Data

We present a novel method to detect parallel fragments within noisy parallel corpora. Isolating these parallel fragments from the noisy data in which they are contained frees us from noisy alignments and stray links that can severely constrain translation-rule extraction. We do this with existing machinery, making use of an existing word alignment model for this task. We evaluate the quality an...

متن کامل

Cross-lingual WSD for Translation Extraction from Comparable Corpora

We propose a data-driven approach to enhance translation extraction from comparable corpora. Instead of resorting to an external dictionary, we translate source vector features by using a cross-lingual Word Sense Disambiguation method. The candidate senses for a feature correspond to sense clusters of its translations in a parallel corpus and the context used for disambiguation consists of the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996